convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization by richarddd · Pull Request #20539 · ggml-org/llama.cpp

richarddd · 2026-03-14T07:21:44Z

Adds support for converting mixed-precision ModelOpt models (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) that use per-tensor quant_algo with both NVFP4 and FP8 layers, instead of a single global quant_algo: "NVFP4". NVFP4 tensors (2D scales) are repacked natively while FP8 tensors (1D scales) are dequantized to float.

Fixes: #20504

vbooka1 · 2026-03-14T09:59:47Z

Either llama.cpp or convert_hf_to_gguf.py is broken, model https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 does not work with builds 8304 (first supporting Nemotron 3) and 8334 (latest): I'm getting error llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 844, got 843.

Please check.

convert.txt

launch.txt

richarddd · 2026-03-14T11:22:28Z

Either llama.cpp or convert_hf_to_gguf.py is broken, model https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 does not work with builds 8304 (first supporting Nemotron 3) and 8334 (latest): I'm getting error llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 844, got 843.

Please check.

convert.txt

launch.txt

Yeah something is off. I didn’t properly smoke test due to lack of memory.

CISC · 2026-03-14T21:34:28Z

@vbooka1 @richarddd Fixed by #20506

…ntization

convert_hf_to_gguf.py

…P4/FP8 quantization (ggml-org#20539) * support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization * cleanup * fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

richarddd requested a review from CISC as a code owner March 14, 2026 07:21

github-actions bot added the python python script changes label Mar 14, 2026

richarddd marked this pull request as draft March 14, 2026 20:47

CISC mentioned this pull request Mar 14, 2026

model : wire up Nemotron-H tensors for NVFP4 support #20561

Merged

richarddd marked this pull request as ready for review March 15, 2026 06:16

support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 qua…

585e8da

…ntization

richarddd force-pushed the fix/nvfp4-mixed-precision-convert branch from 3530623 to 585e8da Compare March 15, 2026 06:19

CISC reviewed Mar 15, 2026

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

richarddd and others added 2 commits March 15, 2026 18:27

cleanup

20f51b1

fallback

6ff4eac

CISC approved these changes Mar 15, 2026

View reviewed changes

CISC merged commit 079e5a4 into ggml-org:master Mar 16, 2026
6 checks passed

richarddd deleted the fix/nvfp4-mixed-precision-convert branch March 16, 2026 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539

convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539
CISC merged 3 commits intoggml-org:masterfrom
richarddd:fix/nvfp4-mixed-precision-convert

richarddd commented Mar 14, 2026

Uh oh!

vbooka1 commented Mar 14, 2026

Uh oh!

richarddd commented Mar 14, 2026

Uh oh!

CISC commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

richarddd commented Mar 14, 2026

Uh oh!

vbooka1 commented Mar 14, 2026

Uh oh!

richarddd commented Mar 14, 2026

Uh oh!

CISC commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants